A hierarchical bivariate meta-analysis of diagnostic test accuracy to provide direct comparisons of immunoassays vs. indirect immunofluorescence for initial screening of connective tissue diseases

Objectives: To compare indirect immunofluorescence (IIF) for antinuclear antibodies (ANA) against immunoassays (IAs) as an initial screening test for connective tissue diseases (CTDs). Methods: A systematic literature review identified crosssectional or case-control studies reporting test accuracy data for IIF and enzyme-linked immunosorbent assays (ELISA), fluorescence enzyme immunoassay (FEIA), chemiluminescent immunoassay (CLIA) or multiplex immunoassay (MIA). The meta-analysis used hierarchical, bivariate, mixed-effect models with random-effects by test. Results: Direct comparisons of IIF with ELISA showed that both tests had good sensitivity (five studies, 2321 patients: ELISA: 90.3% [95% confidence interval (CI): 80.5%, 95.5%] vs. IIF at a cut-off of 1:80: 86.8% [95% CI: 81.8%, 90.6%]; p = 0.4) but low specificity, with considerable variance across assays (ELISA: 56.9% [95% CI: 40.9%, 71.5%] vs. IIF 1:80: 68.0% [95% CI: 39.5%, 87.4%]; p = 0.5). FEIA sensitivity was lower than IIF sensitivity (1:80: p = 0.005; 1:160: p = 0.051); however, FEIA specificity was higher (seven studies, n = 12,311, FEIA 93.6% [95% CI: 89.9%, 96.0%] vs. IIF 1:80 72.4% [95% CI: 62.2%, 80.7%]; p < 0.001; seven studies, n = 3251, FEIA 93.5% [95% CI: 91.1%, 95.3%] vs. IIF 1:160 81.1% [95% CI: 73.4%, 86.9%]; p < 0.0001). CLIA sensitivity was similar to IIF (1:80) with higher specificity (four studies, n = 1981: sensitivity 85.9% [95% CI: 64.7%, 95.3%]; p = 0.86; specificity 86.1% [95% CI: 78.3%, 91.4%]). More data are needed to make firm inferences for CLIA vs. IIF given the wide prediction region. There were too few studies for the meta-analysis of MIA vs. IIF (MIA sensitivity range 73.7%–86%; specificity 53%–91%). Conclusions: FEIA and CLIA have good specificity compared to IIF. A positive FEIA or CLIA test is useful to support the diagnosis of a CTD. A negative IIF test is useful to exclude a CTD.


Introduction
The presence of antinuclear antibodies (ANA) can indicate an autoimmune (AI) disease such as a connective tissue disease (CTD). International guidelines state that the diagnosis of a CTD requires a panel of tests, with the detection of ANA as the first-level screening test [1]. If the ANA screening is positive, then further steps would follow, whereby specific antibody tests are performed to definitely rule in an autoimmune rheumatic disease (ARD). A definitive diagnosis of a CTD including the specific type of CTD would be based on the diagnostic criteria for each CTD classification [2][3][4][5][6][7][8][9][10][11][12][13]. Therefore, it is important that the ANA test is accurate as this is the first stage in the diagnostic pathway, and the results will determine subsequent follow-up.
There is a broad consensus that the indirect immunofluorescence (IIF) test on human epidermoid laryngeal carcinoma cells (HEp-2 or HEp-2000 cells) is considered the "gold standard" for the detection of ANA [1,14,15]. However, as ANA can be present in sera from patients with other rheumatic diseases, in patients with nonrheumatic disorders (e.g. cancer, infection) or in healthy individuals [1,[16][17][18], the test can have low specificity for CTD. Furthermore, IIF is a labour-intensive technique that requires highly skilled laboratory technicians to interpret the result and is therefore subject to high inter-observer variability [19][20][21]. Solid-phase immunoassays (IAs) offer an alternative to IIF. IAs have been developed to screen for specific analytes associated with CTD, and fully automated systems can overcome some of the limitations of a manual IIF mentioned earlier.
An enzyme-linked immunosorbent assay (ELISA) is a plate-based assay technique whereby an antigen is immobilized on a solid surface. Autoantibodies bind to the antigen and are complexed with an antibody linked to an enzyme. A generic ELISA detects ANA of a broad specificity by including an extract from HEp-2 cells (which can be complemented by individual autoantigens). Moreover, specific ELISAs are available that react with single autoantigens associated with CTD, such as dsDNA, SS-A/ Ro, SS-B/La, Scl-70, Sm and Sm/RNP. The format of the ELISA can be modified to detect the antigen directly via a primary antibody or indirectly via a secondary antibody.
Another method is the solid-phase fluorescence enzyme immunoassay (FEIA) that is designed as a sandwich IA whereby the analyte to be measured is 'sandwiched' between an autoantigen coated to the solid phase and a detection antibody that is linked to an enzyme that produces a fluorescence signal. A commercially available automated FEIA test for CTD (EliA CTD Screen, Thermo Fisher Scientific) is coated with 15 antigens that are associated with CTDs (dsDNA, SSA/Ro 60 kDa, SSA/Ro 52 kDa, SSB/La, U1-RNP (RNP-70, A,C), Sm, Jo-1, Scl-70, centromere B, fibrillarin, RNA Pol III, PM-Scl, Mi-2, Rib-P and PCNA). The fluorescence of the reaction is measured automatically, and the higher the fluorescence intensity, the higher the antibody level in the sample.
A variation on this is the chemiluminescent immunoassay (CLIA), where the enzymes linked to the detection antibody produce a luminescence via a chemical reaction. To the best of our knowledge, there has been no systematic assessment of the diagnostic accuracy of different IAs vs. IIF to screen for ANA as an initial step towards diagnosing a CTD. A recent review has been published examining the diagnostic accuracy of two solid-phase assays (SPAs) vs. IIF based on data from seven studies [22]. However, this publication did not use meta-analysis methods to combine the data across studies and the results presented in the paper were reported for the two SPAs combined. A previous publication compared the diagnostic test accuracy of FEIA against IIF as a single test and as a double test strategy [23]. We set out to extend this analysis and assess diagnostic test accuracy for a range of IA techniques vs. IIF for ANA screening as an initial step in the diagnosis of a CTD.
To this end, we had two key objectives. The first was to conduct a comprehensive systematic literature review to identify all published studies evaluating ELISA, FEIA, CLIA or MIA vs. IIF as an initial screening test for CTD, to assess the study quality and to provide an overview of the diagnostic test accuracy data reported in these studies. A second objective was to combine the available diagnostic test accuracy data in a meta-analysis using a robust statistical method, to provide a direct comparison of the sensitivity and specificity of the different IAs vs. IIF for screening of CTD. Our review provides a better understanding of the available evidence in support of different ANA tests for CTD screening, as well as a formal and robust meta-analysis that allows for the average diagnostic accuracy and variation in test performance to be quantified.

Systematic literature review process
A structured literature search and systematic literature review was conducted as per the Cochrane Collaboration recommendations for a review of diagnostic test accuracy studies [24]. The search strategy combined search filters for CTD, index tests and diagnostic accuracy test studies using Emtree/Medical Subject Headings (MeSH) terms and free text strings. An electronic search using these filters was conducted using MEDLINE, Embase (see Supplementary Material, Table S-5) and Cochrane databases (from 2000 to March 2018) along with handsearching to identify fully paired, cross-sectional or casecontrol studies of ANA screening of CTD where the study reported diagnostic test accuracy for an IA of interest and IIF.
All citations retrieved from the electronic search and handsearching were imported into a reference manager (EndNote X8) for screening by two reviewers (MEO, MDO). The initial screening was based on the citation title and abstract, with a second screen using full-text papers to confirm the eligibility of the study for inclusion in the systematic review. The literature search citation flow is reported as per the PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) statement [25].

Inclusion criteria
Study design: Included studies were observational cross-sectional or cohort studies of diagnostic test accuracy. Only fully paired studies were included in the meta-analysis, i.e. studies needed to report results for an IIF and at least one IA using the same cohort of patients.
Population: To be included in the analysis, study populations needed to include a CTD cohort with a range of CTD conditions and a non-CTD diseased control (DC) group (another relevant disease) to reflect the type of patients who may be referred for ANA testing in practice.
The conditions on the CTD spectrum that are associated with the presence of ANA and are included in the CTD group include systemic lupus erythematosus (SLE) incorporating sub-acute cutaneous lupus erythematosus (ScLE); Sjögren's syndrome (SjS); systemic sclerosis (SSc) including limited scleroderma (lim SD); inflammatory myopathies (IM) such as dermatomyositis (DM) and polymyositis (PM); mixed connective tissue disease (MCTD) and undifferentiated connective tissue disease (UCTD). If a study included patients with other rheumatic diseases (e.g. rheumatoid arthritis [RA]) but included these results in the CTD cohort, then data were adjusted to account for these patients in the DC group instead. Studies that investigated one type of CTD, e.g. SLE patients only, and studies that did not include both a CTD and a control group were excluded as this does not reflect the spectrum of patients referred for ANA testing in practice. If a study included healthy controls (HCs) as part of the DC group, then these patients were excluded from the analysis, wherever this was feasible (see Table S -1). Studies that included only 100% healthy individuals as controls were excluded from the review as test specificity in healthy patients will differ from that in diseased patients and this group does not reflect the spectrum of patients referred for ANA testing in practice. For studies that reported data pre-and post-diagnosis, data for the pre-diagnosis samples were used wherever feasible.
Index tests: All studies needed to report test performance data for both an IIF and an IA method of interest, namely an ELISA, FEIA, CLIA or MIA. Tests that are not available for use in practice or tests that have been discontinued were not included (examples of excluded tests are Bindazyme, Diastat, Varelisa, Synelisa, COBAS Core HEp-2 ANA-EIA). The cut-off for a positive ANA test can vary across studies and, for some studies, results are reported for more than one cut-off. The Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy [24] asserts that "estimating summary sensitivity and specificity by pooling studies which mix thresholds will produce an estimate that relates to some notional unspecified average of the thresholds that occur in the included studies, which is clinically unhelpful and must be avoided". To ensure comparisons reflect practice, and to avoid estimating test performance at some nominal 'average', the intention was to conduct the analysis using data for IIF at a cut-off of 1:160 as per international recommendations [1], or at 1:80, which is the new European League Against Rheumatism (EULAR)/American College of Rheumatology (ACR) entry criterion for SLE classification [26,27]. Studies that only report results for IIF at other cut-off levels were not included. For the IAs, the cut-off is as per the manufacturer's recommendations for use in practice, the cut-off for FEIA being >1. Please see our previous paper for a summary of data for IIF and FEIA including other thresholds (1:320 for IIF and 0.7 for FEIA) [23].
Reference standard: The reference standard in this review is the clinical follow-up used to definitely confirm a final diagnosis of a CTD, or definitely rule out a CTD, regardless of whether ANA were detected by the index screening test or not. To be included in the review, a study needed to include a reference standard whereby all patients had their diagnostic status confirmed (i.e. CTD or not CTD). Studies reporting concurrence between tests where the patients' diagnostic status was unknown were not included.
Outcome data: Each study needed to report the number of true positives (TPs), true negatives (TNs), false positives (FPs) and false negatives (FNs) for each test. Alternatively, the study needed to report the sensitivity and specificity of each test along with the number of patients in the CTD and DC cohorts such that test count data can be replicated from this information.

Quality assessment
The study quality assessment was adapted from the QUADAS-2 checklist [28] to assess the quality of each study in relation to patient selection, attrition, flow and timing of the tests, and conduct and interpretation of the index tests and reference standard. As part of the assessment, the reference standard was compared to the most recent clinically accepted diagnostic criteria for CTD classification as follows: SLE: 1997 ACR criteria [2] or 2012 Systemic Lupus International Collaborating Clinics criteria [3]; SjS: 2016 ACR/EULAR criteria [4] or 2012 ACR/EULAR criteria [5]; SSc: 2013 ACR/EULAR criteria [6]; PM/ DM: Bohan and Peter [7,8], Dalakas and Hohlfeld's criteria 2003 [9] or European Neuromuscular Centre criteria 2004 [10]; MCTD/UCTD: Alarcón-Segovia and Villarreal [11], Kasukawa et al. [12] or Sharp and Anderson [13]. The quality of the reference standard was graded A-E. Where the diagnosis/classification of CTD is based on the most recent disease-specific guidelines or classification criteria available at the time of the review as listed earlier, it was graded A. Grade B was used for disease-specific classification criteria that have been superseded by more recent guidelines [29][30][31][32]; grade C, where some clinical criteria and most relevant immunological criteria were used (e.g. authors indicated that disease-specific classification criteria were used but did not provide references); grade D, where some relevant clinical criteria were used (e.g. authors indicated that they used some formal criteria but did not provide references for the criteria) and grade E, referring to a reference standard that is not described with sufficient detail in the publication.

Study data summary estimates
The sensitivity of a test is defined as the probability that the index test result will be positive in a patient with CTD. The specificity of a test is defined as the probability that the index test result will be negative in non-CTD DCs. For each study, the sensitivity and specificity of each test were calculated and the 95% confidence interval (CI) around the sensitivity and specificity estimates was calculated using the exact binomial method [33]. The diagnostic odds ratio (DOR) is a summary estimate of how many times higher the odds are of obtaining a positive test result in a diseased rather than a non-diseased person. DOR can be a useful measure when comparing tests if there is no preference for either superior sensitivity or specificity and the focus is on global performance [34]. If the DOR is less than one, then the test is uninformative and is of no clinical value.
In order to summarise all of the available data at all reported test thresholds, a hierarchical summary receiver operating characteristic (HSROC) curve was produced for each index test to provide an overall summary of the diagnostic test accuracy data. A HSROC model [35] is fitted to the study data for each test using the metandi package in STATA MP v14.2 [36]. The HSROC model is used to estimate two parameters, test accuracy (lnDOR) and asymmetry (change in DOR relative to change in the test threshold/sensitivity). The estimates from the HSROC model are used to plot a summary ROC curve of sensitivity vs. specificity (expressed as 1 − specificity), the 95% confidence region around this summary estimate and a 95% prediction region taking into account unobserved heterogeneity: if a new study was conducted, we would expect the 'true' sensitivity and specificity to lie within the prediction region with a 95% confidence level [24,36]. The prediction region can be wider than the confidence region as it goes beyond the uncertainty in the available data [34]. The results of this summary analysis are reported in the section Summary of study estimates of diagnostic accuracy by test and Figure 1.

Comparative meta-analysis methods
The meta-analysis was conducted using hierarchical, bivariate, mixed-effect models as recommended in the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy [24] and was conducted in STATA MP v14.2 using the meqrlogit function [37]. The hierarchical mixed-effect model applies statistical distributions at two levels: at a study level to account for variation within studies (differences between patients) and at a higher level to account for variation between studies. At the study level, the sensitivity and specificity are estimated directly from the TP, TN, FP and FN counts, and assumes that sensitivity and specificity are binomially distributed [38,39]. Furthermore, the model assumes a correlation between sensitivity and specificity modelled as a single bivariate normal distribution. This correlation models the expected trade-off between sensitivity and specificity (an increase in sensitivity is usually associated with a decrease in specificity). The mixed-effect model includes test-specific sensitivity and specificity (via dummy covariates for the test type [37]) as well as test-specific random-effects, i.e. the model has separate variance estimates for ELISAs, FEIA, MIA, CLIA and IIF to account for the variability in test assays within the test method groups.
In addition to the aforementioned summary estimates, the bivariate meta-analysis estimates the likelihood ratio (LR) which is the probability that a given test result is obtained in the CTD patients compared to the probability of the same results in the controls. The positive likelihood ratio (LR+) describes how many times more likely positive index test results were in the diseased group compared to the non-diseased group. The negative likelihood ratio (LR−) summarises how many times less likely negative index test results were in the diseased group compared to the non-diseased group. In order for the summary estimates to be clinically meaningful, the bivariate metaanalysis has been limited to studies where the test data are reported at a common cut-off such that the results provide an average operating point [24] as well as a 95% CI. The statistical significance of differences between tests is based on the p-value estimated from a two-sided t-test and statistically significant differences are defined as a p-value <0.05.
For the comparative meta-analysis, we conducted separate analyses using data for IIF at thresholds of 1:80 and 1:160.

Summary of the quality of the studies included in the review
Overall the quality assessment indicated that the studies were of sufficient quality in terms of patient selection and participant flow (see Supplementary Material,

Summary of study estimates of diagnostic accuracy by test
By design, to be included in the review, all studies must be fully paired and report data for one of the IAs as well as data for IIF at a cut-off of 1:80 or 1:160. Therefore, all 17 studies [40][41][42][43][44][45][46][47][48][49][50][51][53][54][55][56][57] contributed diagnostic test accuracy data for IIF (see Supplementary Material, Table S-4). Thirteen studies [40, 42-47, 50, 51, 53, 54, 56, 57] reported data at a cut-off of 1:80, and eight studies [40, 41, 48-51, 53, 55] reported data at a cut-off of 1:160. Figure 1, top left panel, is a plot of the study estimates for IIF sensitivity vs. specificity at these thresholds (one circle per study estimate, where the size of the circle corresponds to the size of the study cohort). Whilst the sensitivity of the IIF test for CTD is good in most studies, there is a large variance in specificity across the studies. The HSROC curve (blue line) shows that an increase in IIF test sensitivity results in a marked decrease in the specificity. The prediction region (grey dashed line) indicates a high level of uncertainty in the estimates for IIF specificity: given the available data, this is the region where the 'true' sensitivity and specificity of a new study is expected to lie with a 95% confidence level.
The HSROC plot of the study estimates for ELISA sensitivity vs. specificity is similar to IIF: variance in the test results across the studies lead to a high level of uncertainty in the estimates of diagnostic test accuracy (a large prediction region). Figure 1, top right panel, shows 10 ELISA test results (from six studies [43,[45][46][47][48]54]) with a similar association between increased sensitivity and decreased specificity as IIF. A HSROC plot of ELISA test data for ELISAs with and without a HEp-2 component is shown in Supplementary Material, Figure S- The plots for IIF and ELISA can be contrasted with the plot of the study estimates for FEIA (Figure 1, middle left panel). Of the 17 fully paired studies in this review, 10 studies [40, 41, 44, 49-51, 53, 55-57] reported diagnostic test accuracy data for FEIA at the manufacturer's recommended cut-off of >1. As the sensitivity of the FEIA for CTD increases, there is only a small decrease in specificity (Figure 1, middle left panel: HSROC, blue line). The prediction region arising from the FEIA diagnostic test accuracy data is smaller than that for IIF and ELISA, indicating that we can be more certain of our estimates for the sensitivity and specificity of FEIA compared to our estimates for IIF or ELISA.
Of the 17 fully paired studies included in this review, four studies [42,44,45,56] reported diagnostic test accuracy data for CLIA (Figure 1, middle right panel). There is a high level of uncertainty in the estimates of diagnostic test accuracy for CLIA given the large confidence and prediction region, and that the 95% prediction region crosses the line of 'no effect' (DOR = 1). If a new CLIA study was conducted, we would expect the sensitivity and specificity to lie within the prediction region with a 95% confidence level [24,36]. Given the size of the prediction region, our estimates lack certainty, and therefore the estimates for CLIA should be viewed with caution.
Of the 17 fully paired studies included in this review, three studies [46,52,54] reported diagnostic test accuracy data for MIA (Figure 1, bottom left panel). The prediction region is similar to the prediction region for IIF, though for MIA the uncertainty in the estimates could be driven by a lack of data (HSROC analysis is underpowered) whereas the variance in the IIF and ELISA results could be due to variation in test designs, implementation of the test in practice and operator subjectivity.

Meta-analysis of ELISA vs. IIF
Five studies incorporating 2321 patients reported diagnostic test accuracy data for both IIF at a 1:80 dilution and an ELISA [43,[45][46][47]54]. Results from the mixed-effect bivariate model (Table 2) showed no significant difference Antibody Associated with  in sensitivity and specificity between IIF and ELISA. The large 95% CI in specificity for both tests indicates that the estimates are subject to uncertainty. The DOR was comparable for ELISA and IIF. Similar results ( Table 2) were obtained for a sensitivity analysis restricting the analysis to three studies reporting diagnostic test accuracy data for both IIF at a 1:80 dilution and an ELISA with a HEp-2 component [43,45,46]. There were too few studies to conduct a meta-analysis for ELISA vs. IIF at a cut-off of 1:160.

Meta-analysis of FEIA vs. IIF
Two meta-analyses were conducted using a bivariate mixed-effect model and subsets of studies that reported direct comparisons of FEIA and IIF at a cut-off of 1:160 (seven studies [40, 41, 49-51, 53, 55], 3251 tests) and 1:80 (seven studies [40,44,50,51,53,56,57], 12,311 tests). Table 2 shows the sensitivity, specificity, DOR, LR+ and LR− estimated from the mixed-effect model. The sensitivity of FEIA was statistically significantly lower than the sensitivity of IIF at a cut-off of 1:80 (p = 0.005) and was lower compared to that of IIF at a cut-off of 1:160 (p = 0.051). FEIA had a significantly higher specificity than IIF at a cut-off of 1:80 (p < 0.0001) and 1:160 (p < 0.001) and the DOR was higher with FEIA compared to IIF.

Meta-analysis of CLIA vs. IIF
A meta-analysis was conducted using a bivariate mixedeffect model and data from four studies that reported data for the CLIA method [42,44,45,56]. Three of these studies report data for the same type of CLIA [42,44,56] (see Supplementary Material, Table S-4), with one study reporting data for a different CLIA with a HEp-2 extract [45]. There was no significant difference in the sensitivity of CLIA vs. IIF at a cut-off of 1:80 (p = 0.68, Table 2). CLIA had a significantly higher specificity than IIF at a cut-off of 1:80 (p = 0.01). It was noted that the model estimate for the 95% CI for CLIA sensitivity and specificity is large indicating that the estimates are subject to uncertainty: more data are needed to determine whether the large CIs are a fair reflection of the actual variance or are due to the analysis being underpowered. Across the four studies, the reported CLIA sensitivity for CTD was as high as 98.6% (for a CLIA without a HEp-2 component [56]) and as low as 62.9% (for a CLIA with a HEp-2 component [45]), with specificity ranging from 76% [56] to 94% [42] (see Supplementary Material, Table S -4). Given the size of the prediction region, and the lack of a clinical rationale for the variance in the results, the aforementioned estimates for CLIA should be viewed with caution. There were too few studies to conduct a meta-analysis for CLIA vs. IIF at a cut-off of 1:160.

Meta-analysis of MIA vs. IIF
There were too few studies to conduct a robust meta-analysis using the bivariate mixed-effect model. Across the three studies that did report data [46,52,54], the sensitivity of MIA for CTD was as high as 86% [46] and as low as 73.7% [52] with specificity ranging from 53% [54]

Pre-test vs. post-test probability for IIF and IAs
Eleven [40][41][42][43][44][45][46][47][49][50][51] out of the 17 fully paired studies included in the review have a case-control design, and the ratio of CTD patients to non-CTD patients does not reflect the prevalence of CTD in a clinical setting. The largest prospective cross-sectional study included in this review tested 9856 consecutive patient sera submitted to the clinical laboratory for ANA testing [57]. The prevalence of CTD in the study population was estimated to be 2.7% (267/9856) including 22 cutaneous lupus patients (or 2.5% not including these patients). In 62 patients, the clinician strongly considered the presence of an ANAassociated systemic rheumatic disease (and started treatment), but the patients did not fulfil the diagnostic criteria. If these cases are included as CTD patients, then the prevalence was estimated to be 3.9% or 4.1% including cutaneous lupus. Based on the average operating point estimates from bivariate meta-analysis shown in the previous sections, the pre-test vs. post-test probability of CTD is shown in Figure 2 (after a positive test) and Figure 3 (after a negative test), assuming that the prevalence (pre-test probability) of CTD is in the range of 0-5%. It should be noted that for individual patients the pre-test probability can be higher if typical clinical signs are overt. Figure 2 indicates that a positive IIF test is more likely to indicate CTD, than a positive ELISA test. Assuming an underlying CTD prevalence of 2.7% and using the average operating point estimates for sensitivity and specificity from Table 2, for every 1000 patients tested, ELISA correctly identifies 24 out of 27 patients with CTD, and 554 out of 973 patients without CTD (57.8% correctly identified). IIF at a cut-off of 1:80 has a similar sensitivity (23 out of 27 with CTD) but more TNs (662 out of 973 without CTD) such that 68.5% of patients are correctly identified by the IIF test. The post-test probability of CTD following a positive IIF (1:80) test is 7.0% (23 TPs out of 335 positive ANA tests) and 5.5% (24 TPs out of 335 positive ANA tests) following a positive ELISA. Figure 2 indicates that a positive FEIA test is more likely to rule in a CTD than IIF at a cut-off of 1:80 (based on the average operating point estimates from Table 2). For every 1000 patients tested and a background prevalence of 2.7%, FEIA correctly identifies 21 out of 27 patients with CTD, and 911 out of 973 patients without CTD (93.2% correctly identified). The post-test probability of CTD following a positive FEIA test is 25.4% (21 TPs out of 83 positive ANA tests).
Based on the average operating point estimates from Table 2, and given that these results should be viewed with some caution, Figure 2 shows that a positive CLIA test is also more likely to rule in a CTD than IIF at a cut-off of 1:80. For every 1000 patients tested and a background prevalence of 2.7%, CLIA correctly identifies 23 out of 27 patients with CTD, and 838 patients out 973 patients without CTD (86.1% correctly identified). The post-test probability of CTD following a positive CLIA test is 14.6% (23 TPs out of 158 positive ANA tests).
Assuming a hypothetical sensitivity of 81.9% and specificity of 74.6% for the MIA method (based on data from three studies [46,52,54]), then Figure 2 shows that a positive MIA test is similar to an IIF at a cut-off of 1:80 for ruling in CTD. For every 1000 patients tested and a background prevalence of 2.7%, MIA may correctly identify 22 out of 27 patients with CTD, and 727 out of 973 patients without CTD (74.9% correctly identified). The post-test probability of CTD following a positive MIA test is 8.2% (22 TPs out of 269 positive ANA tests). Figure 3 indicates that the diagnostic value of a negative FEIA and CLIA test is similar to a negative IIF (1:80) test if the prevalence of CTD is low. At a prevalence rate of 2.7%, the post-test probability of CTD following a negative IIF (1:80) test is 0.54% (four FN tests out of 665 negative tests), 0.47% (three FN tests out of 556 negative tests) following a negative ELISA, 0.63% (six FN tests out of 917 negative tests) following a negative FEIA and 0.45% (four FN tests out of 842 negative tests) following a negative CLIA. Based on the average hypothetical sensitivity and specificity for MIA stated earlier, the post-test probability of CTD following a negative MIA test is estimated to be 0.67% (five FN tests out of 731 negative tests). It should be noted that the differences between the different assays become more pronounced at higher pre-test probabilities.

Discussion
The aim was to provide direct comparisons of the sensitivity and specificity of different IAs vs. IIF for the initial screening of CTD using up-to-date evidence from published diagnostic test accuracy studies. All of the studies included in the review reported diagnostic test accuracy of an ANA test for detecting CTD, that is, the studies reported the number of patients who tested positive or negative for ANA for each assay, and the number of those patients confirmed to have a CTD or not. For FEIA vs. IIF, some studies also reported the number of patients who tested positive or negative for ANA by CTD disease type such that a disease specific sub-group analysis could be conducted. This analysis has been published elsewhere [23]. For the other IAs considered in our review, disease-specific test performance data were reported by very few studies [

45
, 51] such that a meta-analysis by CTD subtype was not possible. The meta-analysis results are generalisable to the initial screening test to rule in or rule out a CTD. Our review and meta-analysis will not be generalisable to the latter stages in the diagnostic pathway and does not report the sensitivity and specificity of ANA tests in the diagnosis of specific diseases such as SSc, SLE, SjS, DM, PM and MCTD. There were sufficient data to conduct a meta-analysis to assess the diagnostic accuracy of ELISA (various) vs. IIF at a cut-off of 1:80, FEIA (one type) vs. IIF at a cut-off of 1:80 and 1:160 and CLIA (two types) vs. IIF at a cut-off of 1:80 though the latter analysis should be viewed with caution as more data are required to increase the precision of the estimates. The meta-analysis showed no significant difference in sensitivity and specificity between IIF at a cut-off of 1:80 and ELISA for detecting or ruling out CTD (p = 0.8, p = 0.7, respectively). Some of the ELISA tests included in the meta-analysis were generic ANA tests that used an extract of the HEp-2 cell line: a metaanalysis limited to data for these tests also showed no significant differences compared to IIF at 1:80. We were unable to conduct a meta-analysis limited to ELISA tests without a HEp-2 component, though we note that the reported sensitivity of an ELISA without HEp-2 is more wide ranging than an ELISA with HEp-2 (see Figure S-2 [right and left panel, respectively]). The 95% CI around the point estimates and the prediction region around the HSROC curve indicate a large variance in results for both IIF and ELISA, particularly for specificity ( Figure 1). The variance in test results may be due to differences in the underlying test population across studies, but as this variance was not seen in the FEIA HSROC (and given that the analysis included only fully paired studies), we assert that this variance is likely to be due to differences in the conduct and interpretation of the IIF tests. IIF and ELISA tests have the lowest positive predictive value: if the pre-test probability of CTD is 2.7%, the post-test probability of CTD after a positive test is estimated to be 7.0% for IIF at a cut-off of 1:80 and 5.5% for ELISA, compared to 25.4% for FEIA, and 14.6% for CLIA (and ~8.2% for MIA based on available data). A positive ANA screening test will be followed up with additional laboratory workup and unnecessary costs that may include a second ANA test and tests for specific antibodies [1]. In one study that followed up 96 patients with laboratory requests for ANA screening, it was observed that a positive HEp-2 test result generated an average of 4.11 follow-up tests [58]. Based on the average FP rate found in our analysis, this will be translated into 1724 unnecessary follow-up tests per 1000 suspected patients tested with ELISA HEp-2 vs. 1280 when IIF is used as the screening test. Moreover, FP results may be misinterpreted by physicians not familiar with systemic CTDs and lead to unnecessary treatments [17,18,59].
The manual IIF method and ELISA tests require multiple stages of quality control oversight, and proficient laboratory technicians. Fully automated tests can simplify the process, are less hands-on and provide more consistency across laboratories. The meta-analysis showed that FEIA and CLIA have significantly better specificity than IIF (p < 0.05), though the estimate for FEIA was supported by more data (n = 12,311 vs. n = 1981 for the CLIA analysis) and had narrower 95% CIs (FEIA [95% CI: 89.9%, 96.0%]; CLIA [95% CI: 78.3%, 91.4%]). Based on the HSROC curve for FEIA (Figure 1, middle left panel), we can be 95% confident that the 'true' FEIA specificity for CTD lies within the narrow range of values indicated by the prediction region. The 'true' specificity for the other tests is likely to vary more in practice, given the large prediction region.
A good ANA screening assay should be sensitive and exclude CTD in the case of negativity. A recent study in patients with established SLE [60] showed a variance in the frequency of ANA-negative tests across three different IIF (4.9%-22.3%) as well as an ELISA (11.7%) and multiplex bead-based assay (13.6%).
Recently, it has been proposed that combining IIF with FEIA could increase the diagnostic accuracy overall [15,44,57,59,61,62]. The current recommendations are to perform an IIF subsequent to a negative IA test (e.g. FEIA, CLIA) or an IA subsequent to a negative IIF test if there is a high clinical suspicion of a CTD [1]. Our analysis supports this recommendation (see Figure 3, post-test probability of CTD after a negative test). A previous assessment of a double-test strategy [23] based on data from four studies [41,49,55,57] showed that concordant IIF and FEIA results correctly classify 96.8% of patients, and where there is a discrepancy in the test results, a positive FEIA/negative IIF result is more likely to occur in a patient with CTD than a negative FEIA/positive IIF result (LR 2.4 vs. 1.4 [23]). The review by Bizzaro [22] which examined the diagnostic accuracy of two SPAs vs. IIF agreed that neither of the two methods alone would identify all patients with CTD and the best diagnostic strategy could be a combination of the two methods.
The comprehensive systematic literature review identified relevant published studies, and the quality review assessed potential bias arising from the study design, conduct and interpretation of results. The recent independent review by Bizzaro [22] identified seven studies, four of which reported diagnostic test data for CLIA and IIF [42,44,56,62] and six for FEIA and IIF [44,51,55,56,62,63]. We used four studies for the analysis of CLIA vs. IIF [42,44,45,56] and 10 for FEIA vs. IIF [40, 41, 44, 49-51, 53, 55-57]. Two of the studies reported in the Bizzaro review were not used in our meta-analysis. For one of the studies this was because the IIF cut-off was neither 1:80 nor 1:160 (cut-off was 1:100) [63]. The other study was conducted after the cut-off date for our literature search [62]; however, the publication does not report diagnostic test data for FEIA and CLIA separately so would not be eligible for inclusion in our analysis.
For the analysis of FEIA vs. IIF, we included five studies where the data were from conference abstracts or posters [40,41,49,50,53] such that full details of the study methodology were not available in a full-text publication. There were too few studies to conduct a comparative meta-analysis of FEIA vs. IIF using full-text publications only. For the five studies where the data were from conference abstracts/posters [40,41,49,50,53], the average FEIA sensitivity was 75.4% and specificity 92.1%. However, we noted that for the four full-text publications used in the meta-analysis of FEIA vs. IIF at a cut-off of 1:80 [44,51,56,57], the average FEIA sensitivity and specificity were higher at 78.5% and 93.4%, respectively. Similarly, for the two full-text publications used in the meta-analysis of FEIA vs. IIF at a cut-off of 1:160 [51,55], the average FEIA sensitivity was similar to the average from the conference abstracts (75.9%) but the average specificity was higher (95.1%). Therefore, whilst we were unable to conduct a thorough assessment of the study quality, inclusion of data from conference abstracts and posters has not overestimated the diagnostic test accuracy estimates for FEIA.
One alternative to the comparative meta-analysis model that we have used is to include an additional independent variable to allow data at different IIF thresholds to be included in the analysis (cf. Leuchten et al. [14]). However, for our dataset this would mean that the models comparing the different IAs vs. IIF would differ as there were not enough data for some of the tests to allow such an analysis to be performed. A minimum of four studies are required for a random-effects bivariate meta-analysis of diagnostic test accuracy data with a binary covariate for the test and separate random-effects by test.
For MIA vs. IIF there were too few studies to conduct a robust meta-analysis and we have provided hypothetical estimates for post-test probability of CTD only. The metaanalysis of CLIA vs. IIF is based on data from four studies; however, the large 95% CIs generated for the CLIA vs. IIF meta-analysis indicate a high level of imprecision in these estimates. There are too few studies to assess whether the variance is driven by the type of CLIA. The study reporting the lowest sensitivity for CLIA (62.9%) used a CLIA test that included a HEp-2 component [45] which is unexpected given the good sensitivity reported for HEp-2 by IIF tests across the 17 studies included in this review. Further diagnostic test accuracy studies are needed to allow for a more robust analysis of CLIA vs. IIF. Whilst using fully paired data is a key advantage for the meta-analysis that we have conducted, it does not account for correlations between tests applied to the same individual.
Most of the studies used a case-control design that may be simpler to conduct compared to a prospective cross-sectional study using unselected patients referred for ANA testing in the clinic which would be more representative of the disease in a clinical setting. The included studies had CTD cohorts with a representative range of CTD subtypes (SLE, SjS, SSc, MCTD and DM/PM). Specificity was calculated from a cohort of (non-CTD) DCs excluding healthy patients wherever feasible and studies without a representative disease control were excluded. ANA screening tests are used to support diagnosis and ANA levels can change with treatment. In eight of the 17 studies there was no information reported as to whether the sera were sampled before diagnosis or after treatment had been initiated. However, it should be noted that population selection bias would impact all test results within a study, as the analysis included only fully paired studies and the hierarchical meta-analysis grouped test data by study. For 10 studies there was no or a limited description of the reference standard used to confirm the diagnosis of a CTD (see Supplementary Material, Table S-1). There were too few studies to perform sensitivity analyses excluding studies where the patient status is unknown or reference standard is unclear. The diagnosis/classification of CTD is based on the criteria available at the time of the study and it is noted that new criteria for SLE were published in 2019 [26,27]. Based on the validation cohort, the new SLE criteria appear to have similar sensitivity to the previous criteria [3] for ruling in SLE but better specificity for ruling out SLE. As the 2019 criteria are yet to be implemented in clinical practice, a reference standard grade 'A' reflects the best quality reference standard available at the time of this review. It was also noted that the studies used a mix of diagnostic criteria and classification criteria to define the group of patients used as cases in the study, and that diagnostic criteria may be a better entry option as it includes all available information, rather than a short list of specific criteria.
To the best of our knowledge, this is the first metaanalysis to use a bivariate statistical model to integrate diagnostic test accuracy data and to compare IIF with different IAs in the context of ANA screening as an initial step to diagnosing a CTD. A hierarchal bivariate mixed-effect model is a statistically valid method that produces unbiased estimates of the average sensitivity and specificity for each test method, as well as the expected variance around these estimates by controlling for within-and betweenstudy differences [24]. As automated tests produce more consistent results across different laboratories compared to manual methods, our meta-analysis model included different random-effects estimators for each type of test to account for variances within each method. The use of a hierarchal model structure avoids bias from simply averaging data across studies and the bivariate model allows for correlations between sensitivity and specificity. Furthermore, the use of fully paired data allows for direct comparisons between IA and IIFs to be made [24]. Whilst it is known that IIF and ELISA methods vary, this metaanalysis quantifies the extent to which the sensitivity and specificity are likely to vary in clinical practice. The 95% prediction regions in the HSROC graphs provide a visual representation of this variance which helps with the interpretation of the performance of the different assays as initial screening tests for CTD.
In conclusion, this meta-analysis demonstrated that there are differences in diagnostic performance between IAs and has quantified the extent of the variation in diagnostic accuracy for IIF and ELISAs which is likely to be due to different assay setups and test outputs, as well as a lack of standardisation for interpretation of results. FEIA and, to a lesser extent, CLIA have a higher specificity and a higher LR+ than IIF whereas ELISAs are expected to have similar accuracy to IIF. A positive test result with FEIA or CLIA is therefore useful to support the diagnosis of a CTD. IIF has a higher sensitivity and a lower LR− than FEIA or CLIA. A negative IIF test, therefore, is useful to exclude a CTD. Consequently, the most favourable strategy could be to combine a highly sensitive test such as IIF with a highly specific test such as FEIA or CLIA.